In this project we use a dataset of wine reviews to predict review points from numerical, categorical and textual predictors.
The data is from Kaggle Datasets, and covers 150k wine reviews along with some attributes of the wines. It can be found here. (A (free) Kaggle login is required to access it directly from kaggle.com). The data was originally scraped from WineEnthusiast.
The dataset contains the following columns:
Points: the number of points the wine received in its review, on a scale of 1-100. However, only wines with >= 80 points were published in the dataset. We plan to transform this feature and check out logit, probit, BoxCox and log transforms. This will be our response variable.
Description: a description of the wine’s taste, smell, look, feel, etc. We plan to use LDA topic modeling to convert the text description into a small vector of numerical predictors. We plan to generate a “sentiment” numeric predictor using a sentiment classifier. This is an additive predictor.
Price: the cost for a bottle of the wine. We will transform this using log or inverse or a power of the inverse (TBD by CV), or any combination of these.
Variety: the type of grapes used
Country: the country that the wine is from
This is a particularly interesting problem for several reasons:
Feature Engineering
The sentiment of the description of the wine was computed using bing. Sentiment was added as a numeric predictor, named “sentiment”, ranging from -6 to 11.
Because price ranges from 4 to 2300 USD, and is positive, ranging over 3 orders of magnitude, log(price) is used as a predictor, instead of price.
## [1] "points" "price" "continentTopFive"
## [4] "topic1" "sentiment"
Training and Test Set Generation
We create an 80% training, 20% test split for later use in cross validation.
Model Selection
Model #1
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 2824.7, df = 20, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod1), 5000)
## W = 0.91738, p-value < 2.2e-16
Model 2
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 5604, df = 21, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod2), 5000)
## W = 0.99597, p-value = 1.765e-10
Model 3
##
## studentized Breusch-Pagan test
##
## data: mod3
## BP = 4521.8, df = 21, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod3), 5000)
## W = 0.99349, p-value = 2.522e-14
Model 4
##
## studentized Breusch-Pagan test
##
## data: mod4
## BP = 4464.5, df = 21, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod4), 5000)
## W = 0.99624, p-value = 5.742e-10
rmse_data
## CV RMSE
## model1 2.561139
## model2 2.433134
## model3 2.450683
## model4 2.532468
The CV RMSE can be used to construct prediction intervals, in lieu of analytic formulae based on normal and homoskedastic errors, as \(\hat{y} \pm 2*2.4331\) for the prediction interval.
Using Model 2 with this prediction interval resulted in a success ratio of 0.9434985 on the withheld test set (i.e. this percentage of the withheld test set points were within the prediction interval of Model #2).
We used four different strategies to attempt to meet the assumptions of multiple linear regression and minimize cross-validated RMSE. Attempts include transforming response with BoxCox and log, integrated polynomial models over the engineered features, and weighted least squares regression. In the end, the non-transformed response model had the lowest cross-validated RMSE with a value of 2.4331344.
Our coefficient results are as follows:
coef(summary(mod2))
## Estimate Std. Error
## (Intercept) 70.73392019 4.4377394
## log(price) 6.45145661 0.6792085
## continentTopFiveAsia -16.50523848 10.2899555
## continentTopFiveAustralia -11.72544870 4.4348757
## continentTopFiveEuropeTop5 -3.96254922 4.0460673
## continentTopFiveNorthAmerica 6.07493460 4.0362483
## continentTopFiveOtherEurope -0.28473634 4.9468251
## continentTopFiveSouthAmerica -14.87145016 4.2551687
## topic1 16.10546421 8.8309836
## log(price):continentTopFiveAsia -1.00496932 0.3779971
## log(price):continentTopFiveAustralia 0.12886966 0.1699013
## log(price):continentTopFiveEuropeTop5 -0.03923637 0.1573030
## log(price):continentTopFiveNorthAmerica 0.29455211 0.1576415
## log(price):continentTopFiveOtherEurope 0.10960308 0.1859945
## log(price):continentTopFiveSouthAmerica 0.39977604 0.1655958
## log(price):topic1 -6.98679668 1.3173636
## continentTopFiveAsia:topic1 36.34992976 20.0892397
## continentTopFiveAustralia:topic1 22.46995770 8.8227125
## continentTopFiveEuropeTop5:topic1 7.70128325 8.0496260
## continentTopFiveNorthAmerica:topic1 -15.87112737 8.0286509
## continentTopFiveOtherEurope:topic1 0.04715288 9.8293135
## continentTopFiveSouthAmerica:topic1 25.96146830 8.4714437
## t value Pr(>|t|)
## (Intercept) 15.93917850 4.043829e-57
## log(price) 9.49849174 2.178339e-21
## continentTopFiveAsia -1.60401456 1.087143e-01
## continentTopFiveAustralia -2.64391824 8.196639e-03
## continentTopFiveEuropeTop5 -0.97935820 3.274056e-01
## continentTopFiveNorthAmerica 1.50509438 1.323033e-01
## continentTopFiveOtherEurope -0.05755941 9.540997e-01
## continentTopFiveSouthAmerica -3.49491435 4.744376e-04
## topic1 1.82374523 6.819385e-02
## log(price):continentTopFiveAsia -2.65866962 7.846347e-03
## log(price):continentTopFiveAustralia 0.75849736 4.481552e-01
## log(price):continentTopFiveEuropeTop5 -0.24943173 8.030274e-01
## log(price):continentTopFiveNorthAmerica 1.86849366 6.169648e-02
## log(price):continentTopFiveOtherEurope 0.58928117 5.556741e-01
## log(price):continentTopFiveSouthAmerica 2.41416835 1.577312e-02
## log(price):topic1 -5.30362053 1.137905e-07
## continentTopFiveAsia:topic1 1.80942287 7.038861e-02
## continentTopFiveAustralia:topic1 2.54683099 1.087223e-02
## continentTopFiveEuropeTop5:topic1 0.95672559 3.387083e-01
## continentTopFiveNorthAmerica:topic1 -1.97681126 4.806596e-02
## continentTopFiveOtherEurope:topic1 0.00479717 9.961724e-01
## continentTopFiveSouthAmerica:topic1 3.06458606 2.180347e-03
Normality of the errors and homoskedasticity are suspect, as demonstrated by the Breusch-Pagan and Shapiro-Wilk tests. We see that the variance seems highest for points around 90, and tapers to the extremes. However, because Shapiro-Wilk in R can take only a maximum of 5000 residuals, sampling is required, and this test is then dependent on the choice of random seed used in the sampling.
We get around the normal and homoskedastic assumptions by using cross-validated RMSE to create our prediction intervals. Doing so gives a 94% success rate on the withheld test set with a +/- 4.8 point prediction interval.
Potential for future improvements: This project could potentially be further expanded in the future by using deeper NLP, expanding with additional predictors, and expanded model selection attempts. It may also be the case that the data is noisy, with a high base \(\epsilon\) variance of points no matter which predictors are used. (subjective human judgement after all).
About the LDA